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ABSTRACT 

Astronomy depends on ever increasing computing power. Processor clock-rates have 
plateaued, and increased performance is now appearing in the form of additional pro- 
cessor cores on a single chip. This poses significant challenges to the astronomy soft- 
ware community. Graphics Processing Units (GPUs), now capable of general-purpose 
computation, exemplify both the difficult learning-curve and the significant speedups 
exhibited by massively-parallel hardware architectures. We present a generalised ap- 
proach to tackling this paradigm shift, based on the analysis of algorithms. We describe 
a small collection of foundation algorithms relevant to astronomy and explain how they 
may be used to ease the transition to massively-parallel computing architectures. We 
demonstrate the effectiveness of our approach by applying it to four well-known astron- 
omy problems: Hogbom CLEAN, inverse ray-shooting for gravitational lensing, pulsar 
dedispersion and volume rendering. Algorithms with well-defined memory access pat- 
terns and high arithmetic intensity stand to receive the greatest performance boost 
from massively-parallel architectures, while those that involve a significant amount of 
decision-making may struggle to take advantage of the available processing power. 

Key words: methods: data analysis - gravitational lensing: micro - pulsars: general 



1 INTRODUCTION 

Computing resources are a fundamental tool in astronomy: 
they are used to acquire and reduce observational data, sim- 
ulate astrophysical processes, and analyse and visualise the 
results. Advances in the field of astronomy have depended 
heavily on the incre ase in computing power that has fol- 
lowed Moore's Law (|Moorelll965l ) since the mid 1960s; in- 
deed, many contemporary astronomy survey projects and 
astrophysics simulations would simply not be possible with- 
out the evolution Moore predicted. 

Until recently, increased computing power was deliv- 
ered in direct proportion to the increase in central process- 
ing unit (CPU) clock rates. Astronomy software executed 
more and more rapidly with each new hardware release, 
without any further programming work. But around 2005, 
the advance in clock rates ceased, and manufacturers turned 
to increasing the instantaneous processing capacity of their 
CPUs by including additional processing cores in a single 
silicon chip package. Today's mainstream multi-core CPUs 
typically have between 2 and 8 processing cores; these are 
routinely deployed in large-scale compute clusters. 

Fig. [T] places the mainstream CPUs from the last ~ 20 
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Figure 1. Clock-rate versus core-count phase space of Moore's 
Law binned every two years for CPUs (circles) and GPUs (dia- 
monds). There is a general trend for performance to increase from 
bottom left to top right. 
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years in the clock rate versus core-count phase space. In 
this space, the evolution of CPUs turns a 'corner' around 
2005 when clock rates plateaued and multi-core processors 
emerged. Lying above today's fastest multi-core CPUs in 
Fig. [T] though, are the contemporary graphics processing 
units (CPUs), boasting hundreds of cores (' many-cor€} and 
~ 1 Ghz clock rates. CPUs are already useful in their own 
right — providing ~ 30 times the raw computation speed of 
CPUs — but perhaps more interestingly, they represent the 
likely evolution of CPUs. CPUs demonstrate how comput- 
ing power can continue to follow Moore's Law in an era of 
zero (or even negative) growth in clock rates. 

The plateau in processor clock rates is problematic for 
astronomy software composed of sequential codes, wherein 
instructions are executed one after the other. Such codes 
derive no direct performance benefit from the presence of 
multiple processing cores, and their performance will lan- 
guish for as long as processor clock rates remain steady at 
~ 3-4 GHz. Astronomy software must be (re-)written to 
take advantage of many-core processors. 

Astronomers are already cogniscent of this issue. 
Shared- and distributed-memory multi-core CPU machines 
have been exploited using the well-known OpenMP and MPI 
programming model^J. In addition, a number of researchers 
have adapted, written and/or re-written classic astronomy 
codes for the GPU architecture in the last ~ 3 years, and 
gained performance improvements ranging from factors of a 
few to fact ors of several hundred . Some highlights include N- 
body (e.g.,|Hamada et a l .1120091) . radio-telescope signal cor- 
relati on ('e.g.. | Wavth et al.ll20090 . adaptive mesh refinement 



(e.g., Schive et al. 120100 ■ gala xy spectral energy distribu- 



tion ( Jonsson & Primack 2009) and gravitational microlens- 
ing IjThompson et alj|2010l ) codes. 

Inevitably, a section of the astronomy community will 
continue with an ad hoc approach to the adaptation of soft- 
ware from single-core to many-core architectures. In this pa- 
per, we demonstrate that there is a significant difference be- 
tween current computing techniques and those required to 
efficiently utilise new hardware architectures such as many- 
core processors, as exemplified by GPUs. These techniques 
will be unfamiliar to most astronomers and will pose a chal- 
lenge in terms of keeping our discipline at the forefront of 
computational science. We present a practical, effective and 
simple methodology for creating astronomy software whose 
performance scales well to present and future many-core ar- 
chitectures. Our methodology is grounded in the classical 
computer science field of algorithm analysis. 

In Section [2] we introduce the key concepts in algo- 
rithm analysis, with particular focus on the context of many- 
core architectures. We present four foundation algorithms, 
and characterise them as we outline our algorithm analy- 
sis methodology. In Section [3] we demonstrate the proposed 
methodology by applying it to four well-known astronomy 
problems, which we break down into their constituent foun- 
dation algorithms. We validate our analysis of these prob- 
lems against ad hoc many-core implementations as available 



1 While the specifics of parallel CPU systems lie outside the scope 
of this paper, we note that many of the algorithm analysis tech- 
niques we describe lend themselves equally well to these architec- 
tures. 



in the literature and discuss the implications of our approach 
for the future of computing in astronomy in Section [4] 



2 A STRATEGIC APPROACH: ALGORITHM 
ANALYSIS 

Algorithm analysis, pioneered by Donald Knuth (see, e.g., 
lKnuthlll9980 . is a fundamental component of computer sci- 
ence - a discipline that is more about how to solve problems 
than the actual implementation in code. In this work, we are 
not interested in the specifics (i.e., syntax) of implementing a 
given astronomy algorithm with a particular programming 
language or library (e.g., CUDA, OpenCL, Thrust) on a 
cho sen computin g architecture (e.g., GPU, Cell, FPGA). 
As Irlarrisl i|2007f ) notes, algorithm-level optimisations are 
much more important with respect to overall performance on 
many-core hardware (specifically GPUs) than implementa- 
tion optimisations, and should be made first. We will return 
to the issue of implementation in future work. 

Here we present an approach to tackling the transition 
to many-core hardware based on the analysis of algorithms. 
The purpose of this analysis is to determine the potential 
of a given algorithm for a many-core architecture before any 
code is written. This provides essential information about 
the optimal approach as well as the return on investment one 
might expect for the effort of (re-) implementing a particular 
algorith m. Ou r meth odology was in part inspired by the 
work of lHarrisI (|20050 . 

Work in a simila r vein has also been undertaken by 
lAsanovic et al.l (|2006l . 120090 who classified parallel algo- 
rithms into 12 groups, referring to them as 'dwarfs'. While 
insightful and opportune, these dwarfs consider a wide range 
of parallel architectures, cover all areas of computation (in- 
cluding several that are not of great relevance to astron- 
omy) and are limited as a resource by the coarse nature of 
the classification. In contrast, the approach presented here 
is tailored to the parallelism offered by many-core proces- 
sor architectures, contains algorithms that appear frequently 
within astronomy computations, and provides a fine-grained 
level of detail. Furthermore, our approach considers the fun- 
damental concerns raised by many-core architectures at a 
level of abstraction that avoids dealing with hardware or 
software-specific details and terminology. This is in contrast 
to the work by IChe et al.l ((2008) , who presented a useful 
but highly-targeted summary of general-purpose program- 
ming on the NVIDIA GPU architecture. 

For these reasons this work will serve as a valuable and 
practical resource for those wishing to analyse the expected 
performance of particular astronomy algorithms on current 
and future many-core architectures. 

For a given astronomy problem, our methodology is as 
follows: 

(i) Outline each step in the problem, 
(ii) Identify steps that resemble known algorithms (see 
below). 

(a) Outlined steps may need to be further decomposed 
into sub-steps before a known counterpart is recognised. 
Such composite steps may later be added to the collection 
of known algorithms. 
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(iii) For each identified algorithm, refer to its pre-existing 
analysis. 

(a) Where a particular step does not appear to match 
any known algorithm, refer to a relevant analysis method- 
ology to analyse the step as a custom algorithm (see Sec- 
tions [2TT][2]2] and [2T3j|. The newly-analysed algorithm can 
then be added to the collection for future reference. 

(iv) Once analysis results have been obtained for each 
step, apply a global analysis to the algorithm to obtain a 
complete picture of its behaviour (see Section \2A\ . 

Here we present a small collection of foundation algo- 
rithm^ that appear in computational astronomy problems. 
This is motivated by the fact that complex algorithms may 
be composed from simpler ones. We propose that algorithm 
composition provides an excellent approach to turning the 
multi-core corner. Here we focus on its application to al- 
gorithm analysis; in future work we will show how it may 
also be applied to implementation methodologies. The al- 
gorithms are described below using a vector data structure. 
This is a data structure like a Fortran or C array represent- 
ing a contiguous block of memory and providing constant- 
time random access to individual elemental. We use the no- 
tation v[i] to represent the i th element of a vector v. 

Transform: Returns a vector containing the result of 
the application of a specified function to every individual 
element of an input vector. 



out[i] = /(in[i]) 



(1) 



Functions of more than one variable may also be applied to 
multiple input vectors. Scaling the brightness of an image 
(defined as a vector of pixels) is an example of a transform 
operation. 

Reduce: Returns the sum of every element in a vector. 



out 






(2) 



Reductions may be generalised to use any associative bi- 
nary operator, e.g., product, min, max etc. Calculating im- 
age noise is a common application of the reduce algorithm. 
Gather: Retrieves values from an input vector accord- 
ing to a specified index mapping and writes them to an 
output vector. 



out[i] = in [map [i]] 



(3) 



Reading a shifted or transformed subregion of an image is a 
common example of a gather operation. 

Interact: For each element i of an input vector, ini, 
sums the interaction between i and each element j in a sec- 
ond input vector, 1112. 



out[i] =y^/(ini[i] 



"J-abD 



(4) 



where / is a given interaction function. The best-known ap- 
plication of this algorithm in astronomy is the computation 
of forces in a direct N-body simulation, where both input 
vectors represent the system's particles and the interaction 
function calculates the gravitational force between two par- 
ticles. 

These four algorithms were chosen from experience with 
a number of computational astronomy problems. The trans- 
form, reduce and gather operations may be referred to as 
'atoms' in the sense that they are indivisible operations. 
While the interact algorithm is technically a composition of 
transforms and reductions, it will be analysed as if it too 
was an atom, enabling rapid analysis of problems that use 
the interact algorithm without the need for further decom- 
position. 

We now describe a number of algorithm analysis tech- 
niques that we have found to be relevant to massively- 
parallel architectures. These techniques should be applied to 
the individual algorithms that comprise a complete problem 
in order to gain a detailed understanding of their behaviour. 

2.1 Principle characteristics 

Many-core architectures exhibit a number of characteristics 
that can impact strongly on the performance of an algo- 
rithm. Here we summarise four of the most important issues 
that must be considered. 

Massive parallelism: To fully utilise massively- 
parallel architectures, algorithms must exhibit a high level 
of parallel granularity, i.e., the number of required opera- 
tions that may be performed simultaneously must be large 
and scalable. Data-parallel algorithms, which divide their 
data between parallel processors rather than (or in addi- 
tion to) their tasks, exhibit parallelism that scales with the 
size of their input data, making them ideal candidates for 
massively-parallel architectures. However, performance may 
suffer when these algorithms are executed on sets of input 
data that are small relative to the number of processors in 
a particular many-core architecturqj. 

Memory access patterns: Many-core architectures 
contain very high bandwidth main memorj|j in order to 
'feed' the large number of parallel processing units. However, 
high latency (i.e., memory transfer startup) costs mean that 
performance depends strongly on the pattern in which mem- 
ory is accessed. In general, maintaining 'locality of refer- 
ence' (i.e., neighbouring threads accessing similar locations 
in memory) is vital to achieving good performance^. Fig. [2] 
illustrates different levels of locality of reference. 

Collisions between threads trying to read the same lo- 
cation in memory can also be costly, and write-collisions 
must be treated using expensive atomic operations in order 
to avoid conflicts between threads. 

Branching: Current many-core architectures rely on 



2 Note that for these algorithms we have used naming conven- 
tions that are familiar to us but are by no means unique in the 
literature. 

3 Here we use constant-time in the algorithmic sense, i.e., con- 
stant with respect to the size of the input data. In this context 
we are not concerned with hardware-specific performance factors. 



4 Note also that oversubscription of threads to processors is of- 
ten a requirement for good performance in many-core architec- 
tures. For example, an NVIDIA GT200-class GPU may be under- 
utilised with an allocation of fewer than ~ 10 4 parallel threads, 
corresponding to an oversubscription rate of around 50 X. 

5 Memory bandwidths on current GPUs are O(100GB/s). 

6 Locality of reference also affects performance on traditional 
CPU architectures, but to a lesser extent than on GPUs. 
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Figure 2. Representative memory access patterns indicating 
varying levels of locality of reference. Contiguous memory access 
is the optimal case for many-core architectures. Patterns with 
high locality will generally achieve good performance; those with 
low locality may incur severe performance penalties. 
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Figure 3. A schematic view of divergent execution within a 
SIMD architecture. Lines indicate the flow of instructions; white 
diamonds indicate branch points, where the code paths of neigh- 
bouring threads diverge. The statements on the left indicate 
typical corresponding source code. White space between branch 
points indicates a thread waiting for its neighbours to complete 
a divergent code section. 



single instruction multiple data (SIMD) hardware. This 
means that neighbouring threads that wish to execute dif- 
ferent instructions must wait for each other to complete the 
divergent code section before execution can continue in par- 
allel (see Fig. [3]). For this reason, algorithms that involve 
significant branching between different threads may suffer 
severe performance degradation. Similar to the effects of 
memory access locality, performance will in general depend 
on the locality of branching, i.e., the number of different 
code-paths taken by a group of neighbouring threads. 

Arithmetic intensity: Executing arithmetic instruc- 
tions is generally much faster than accessing memory on cur- 



rent many-core hardware. Algorithms performing few arith- 
metic operations per memory access may become memory- 
bandwidth-bound; i.e., their speed becomes limited by the 
rate at which memory can be accessed, rather than the rate 
at which arithmetic instructions can be processed. Memory 
bandwidths in many-core architectures are typically signifi- 
cantly higher than in CPUs, meaning that even bandwidth- 
bound algorithms may exhibit strong performance; however, 
they will not be able to take full advantage of the available 
computing power. In some cases, it may be beneficial to 
re-work an algorithm entirely in order to increase its arith- 
metic intensity, even at the cost of performing more numer- 
ical work in total. 

For the arithmetic intensities presented in this paper, 
we assume an idealised cache model in which only the first 
memory read of a particular piece of data is included in 
the count; subsequent or parallel reads of the same data are 
assumed to be made from a cache, and are not counted. 
The ability to achieve this behaviour in practice will depend 
strongly on the memory access pattern (specifically the lo- 
cality of memory accesses) . 



2.2 Complexity analysis 

The complexity of an algorithm is a formal measure of its 
execution time given a certain size of input. It is often used 
as a means of comparing the speeds of two different algo- 
rithms that compute the same (or a similar) result. Such 
comparisons are critical to understanding the relative con- 
tributions of different parts of a composite algorithm and 
identifying bottle-necks. 

Computational complexity is typically expressed as the 
total run-time, T, of an algorithm as a function of the input 
size, N, using 'Big O' notation. Thus T(N) - O(N) means 
a run-time that is proportional to the input size N. An al- 
gorithm with complexity of T(N) — 0(N 2 ) will take four 
times as long to run after a doubling of its input size. 

While the complexity measure is traditionally used for 
algorithms running on serial processors, it can be gener- 
alised to analyse parallel algorithms. One method is to in- 
troduce a second parameter: P, the number of processors. 
The run-time is then expressed as a function of both A^ and 
P. For example, an algorithm with a parallel complexity 
of T(N,P) = 0(§) will run P times faster on P proces- 
sors than on a single processor for a given input size; i.e., 
it exhibits perfect parallel scaling. More complex algorithms 
may incur overheads when run in parallel, e.g., those requir- 
ing communication between processors. In these cases, the 
parallel complexity will depend on the specifics of the target 
hardware architecture. 

An alternative way to express parallel complexity is us- 
ing the wo rk, W, and dep th, D, metrics first introduced 
formally by iBlellochl |l99fj). Here, work measures the total 
number of computational operations performed by an algo- 
rithm (or, equivalently, the run-time on a single processor), 
while depth measures the longest sequence of sequentially- 
dependent operations (or, equivalently, the run-time on an 
infinite number of processors). The depth metric is a mea- 
sure of the amount of inherent parallelism in the algo- 
rithm. A perfectly parallel algorithm has work complexity 
of W(N) = O(N) and depth complexity of D(N) = 0(1), 
meaning all but a constant number of operations may be 
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Table 1. Analysis of four foundation algorithms 
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performed in parallel. An algorithm with W = O(N) and 
D — O (log TV) is highly parallel, but contains some serial de- 
pendencies between operations that scale as a function of the 
input size. Parallel algorithms with work complexities equal 
to those of their serial counterparts are said to be 'work effi- 
cient'; those that further exhibit low depth complexities are 
considered to be efficient parallel algorithms. The benefit of 
the work/depth metrics over the parallel run-time is that 
they have no dependence on the particular parallel architec- 
ture on which the algorithm is executed, i.e., they measure 
properties inherent to the algorithm. 

A final consideration regarding parallel algorithms is 
Amdahl's law (Amdahl 1967), which states that the maxi- 
mum possible speedup over a serial algorithm is limited by 
the fraction of the parallel algorithm that cannot be (or 
simply is not) parallelised. Assuming an infinite number of 
available processors, the run-time of the parallel part of the 
algorithm will reduce to a constant, while the serial part will 
continue to scale with the size of the input. In terms of the 
work/depth metrics, the depth of the algorithm represents 
the fraction that cannot be parallelised, and the maximum 
theoretical speedup is given by S ma x ~ ^y. Note the im- 
plication that the maximum speedup is actually a function 
of the input size. Increasing the problem size in addition to 
the number of processors allows the speedup to scale more 
effectively. 



2.3 Analysis results 

We have applied the techniques discussed in Sections 12.11 
and 12.21 to the four foundation algorithms introduced at the 
beginning of Section [2] We use the following metrics: 

• Work and depth: The complexity metrics as de- 
scribed in Section \2. 21 

• Memory access locality: The nature of the memory 
access patterns as discussed in Section \2. II 

• Arithmetic intensity: Defined by the triple ratio 
r : w : f representing the number of read, write and func- 
tion evalation operations respectively that the algorithm 
performs (normalised to the input size). The symbol a is 
used, where applicable, to represent the internal arithmetic 
intensity of the function given to the algorithm. 

The results are presented in Table [T] Note that this analysis 
is based on the most-efficient known parallel version of each 
algorithm. 



2.4 Global analysis 

Once local analysis results have been obtained for each step 
of a problem, it is necessary to put them together and per- 
form a global analysis. Our methodology is as follows: 



(i) Determine the components of the algorithm where 
most of the computational work lies by comparing work 
complexities. Components with similar work complexities 
should receive similar attention with respect to parallelisa- 
tion in order to avoid leaving behind bottle-necks as a result 
of Amdahl's Law. 

(ii) Consider the amount of inherent parallelism in each 
algorithm 

by observing its theoretical speedup Smax ~ ^y- 

(iii) Use the theoretical arithmetic intensity of each al- 
gorithm to determine the likelihood of it being limited by 
memory bandwidth rather than instruction throughput. The 
theoretical global arithmetic intensity may be obtained by 
comparing the total amount of input and output data to the 
total amount of arithmetic work to be done in the problem. 

(iv) Assess the memory access patterns of each algorithm 
to identify the potential to achieve peak arithmetic inten- 

sit£|. 

(v) If particular components exhibit poor properties, con- 
sider alternative algorithms. 

(vi) Once a set of component algorithms with good the- 
oretical performance has been obtained, the algorithm de- 
composition should provide a good starting point for an im- 
plementation. 



3 APPLICATION TO ASTRONOMY 
ALGORITHMS 

We now apply our methodology from Section [2] to four typ- 
ical astronomy computations. In each case, we demonstrate 
how to identify the steps in an outline of the problem as 
foundation algorithms from our collection described at the 
beginning of Section[5] We then use this knowledge to study 
the exact nature of the available parallelism and determine 
the problem's overall suitability for many-core architectures. 
We note that we have deliberately chosen simple versions of 
the problems in order to maximise clarity and brevity in 
illustrating the principles of our algorithm analysis method- 
ology. 



3.1 Inverse ray-shooting gravitational lensing 

Introduction: Inverse ray-shooting is a numerical tech- 
nique used in gravitational microlensing. Light rays are pro- 
jected backwards (i.e., from the observer) through an en- 
semble of lenses and on to a source-plane pixel grid. The 
number of rays that fall into each pixel gives an indication of 
the magnification at that spatial position relative to the case 



7 Studying the memory access patterns will also help to identify 
the optimal caching strategy if this level of optimisation is desired. 
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where there was no microlensing. In cosmological scenarios, 
the resultant maps are used to study brightness variations in 
light curves of lensed quasars, providing constraints on the 
physical size of the accretion disk and broad line emission 
regions. 

The two main approaches to ray-shooting are based on 
either the direct calculation o f the gravitational deflection by 
each lens (|Kavser et al.lll986l ; fSchneider fc Weissll986l.ll98^ 



or the use o f a tree hierarchy of psuedo-lenses ( Wambsganssl 
1 19901 . 1 19991 ). Here, we consider the direct method. 
Outline: The ray-shooting algorithm is easily divided into 
a number of distinct steps: 

(i) Obtain a collection of lenses according to a desired 
distribution, where each lens has position and mass. 

(ii) Generate a collection of rays according to a uniform 
distribution within a specified 2D region, where each ray is 
defined by its position. 

(iii) For each ray, calculate and sum its deflection due to 
each lens. 

(iv) Add each ray's calculated deflection to its initial po- 
sition to obtain its deflected position. 

(v) Calculate the index of the pixel that each ray falls 
into. 

(vi) Count the number of rays that fall into each pixel. 

(vii) Output the list of pixels as the magnification map. 

Analysis: To begin the analysis, we interpret the above 
outline as follows: 

• Steps (i) and (ii) may be considered transform opera- 
tions that initialise the vectors of lenses and rays. 

• Step (iii) is an example of the interact algorithm, where 
the inputs are the vectors of rays and lenses and the inter- 
action function calculates the deflection of a ray due to the 
gravitational potential around a lens mass. 

• Steps (iv) and (v) apply further transforms to the col- 
lection of rays. 

• Step (vi) involves the generation of a histogram. As we 
have not already identified this algorithm in Section [2] it 
will be necessary to analyse this step as a unique algorithm. 

According to this analysis, three basic algorithms com- 
prise the complete technique: transform, interact and his- 
togram generation. Referring to Table [T] we see that, in the 
context of a lensing simulation using iV ra y S rays and Menses 
lenses, the amount of work performed by the transform and 
interact algorithms will be W = 0(N lays ) + 0(N\ enses ) and 
W = O^rays-ZVienses) respectively. 

We now analyse the histogram step. Considering first 
a serial algorithm for generating a histogram, where each 
point is considered in turn and the count in its correspond- 
ing bin is incremented, we find the work complexity to be 
W = O (Mays). Without further analysis, we compare this 
to those of the other component algorithms. The serial his- 
togram and the transform operations each perform similar 
work. The interact algorithm on the other hand must, as 
we have seen, perform work proportional to JV raya x N\ eriSeB . 
For large Menses (e.g., as occurs in cosmological microlens- 
ing simulations, where Ni cnscs > 10 4 ) this step will dominate 
the total work. Assuming the number of lenses is scaled with 
the amount of parallel hardware, the interact step will also 
dominate the total run-time. 

Given the dominance of the interact step, we now choose 



to ignore the effects of the other steps in the problem. It 
should be noted, however, that in contrast to cosmological 
microlensing, planetary microlensing models contain only a 
few lenses. In this case, the work performed by the interact 
step will be similar to that of the other steps, and thus the 
use of a serial histogram algorithm alongside parallel ver- 
sions of all other steps would result in a severe performance 
bottle-neck. Several parallel histogram algorithms exist, but 
a discussion of them is beyond the scope of this work. 

Returning to the analysis of the interact algorithm, we 
again refer to Table [T] Its worst-case depth complexity in- 
dicates a maximum speedup of S ~ W — 0(N lays ), i.e., 
parallel speedup scaling perfectly up to the number of rays. 
The arithmetic intensity of the algorithm scales as N\ enBBB 
and will thus be very high. Contiguous memory accesses 
indicate strong potential to achieve this high arithmetic in- 
tensity. We conclude that direct inverse ray-shooting for cos- 
mological microlensing is an ideal candidate for an efficient 
implementation on a many-core architecture. 



3.2 Hogbom CLEAN 

Introduction: Raw ('dirty') images produced by radio in- 
terferometers exhibit unwanted artefacts as the result of the 
incomplete sampling of the visibility plane. These artefacts 
can inhibit image analysis and should ideally be removed by 
deconvolution. Several different techniques have_been devel- 
oped to 'clean' these images. For a review, see iBriggsl l|199a) . 
He re we ana l yse th e image-based algorithm first des cribed 
by iHogboml \l974 ). We note that the algorithm by IClarkl 
( 1980) is now the more popular choice in the astronomy com- 
munity, but point out that it is essentially an approximation 
to Hogbom's algorithm that provides increased performance 
at the cost of reduced accuracy. 

The algorithm involves iteratively finding the brightest 
point in the 'dirty image' and subtracting from the dirty 
image an image of the beam centred on and scaled by this 
brightest point. The procedure continues until the brightest 
point in the image falls below a prescribed threshold. While 
the iterative procedure must be performed sequentially, the 
computations within each iteration step are performed in- 
dependently for every pixel of the images, suggesting a sub- 
stantial level of parallelism. The output of the algorithm is 
a series of 'clean components', which may be used to recon- 
struct a cleaned image. 

Outline: The algorithm may be divided into the following 
steps: 



(i) Obtain the beam image. 

(ii) Obtain the image to be cleaned. 

(iii) Find the brightest point, b, the standard deviation, 
cr, and the mean, fi, of the image. 

(iv) If the brightness of b is less than a prescribed thresh- 
old (e.g., \b — fi\ < 3<t), go to step (ix). 

(v) Scale the beam image by a fraction (referred to as the 
'loop gain') of the brightness of b. 

(vi) Shift the beam image to centre it over b. 

(vii) Subtract the scaled, shifted beam image from the 
input image to produce a partially-cleaned image. 

(viii) Repeat from step (iii). 

(ix) Output the 'clean components'. 



© 2010 RAS, MNRAS OOO.rflfTol 



Analysing Astronomy Algorithms for GPUs and Beyond 7 



Analysis: We decompose the outline of the Hogbom CLEAN 
algorithm as follows: 

• Steps (i) and (ii) are simple data-loading operations, 
and may be thought of as transforms. 

• Step (iii) involves a number of reduce operations over 
the pixels in the dirty image. 

• Step (v) is a transform operation, where each pixel in 
the beam is multiplied by a scale factor. 

• Step (vi) may be achieved in two ways, either by directly 
reading an offset subset of the beam pixels, or by switching 
to the Fourier domain and exploiting the shift theorem. Here 
we will only consider the former option, which we identify 
as a gather operation. 

• Step (vii) is a transform operation over pixels in the 
dirty image. 

We thus identify three basic algorithms in Hogbom 
CLEAN: transform, reduce and gather. Table[T]shows that the 
work performed by each of these algorithms will be compa- 
rable (assuming the input and beam images are of similar 
pixel resolutions) . This suggests that any acceleration should 
be applied equally to all of the steps in order to avoid the 
creation of bottle-necks. 

The depth complexities of each algorithm indicate a 
limiting speed-up of Smscx ~ Of-, — p * ls ) during the reduce 

log JVp X l s 

operations. While not quite ideal, this is still a good result. 
Further, the algorithms do not exhibit high arithmetic inten- 
sity (the calculations involving only a few subtractions and 
multiplies) and are thus likely to be bandwidth-bound. This 
will dominate any effect the limiting speed-up may have. 

The efficiency with which the algorithm will use the 
available memory bandwidth will depend on the memory 
access patterns. The transform and reduce algorithms both 
make contiguous memory accesses, and will thus achieve 
peak bandwidth. The gather operation in step (vi), where 
the beam image is shifted to centre it on a point in the in- 
put image, will access memory in an offset but contiguous 
2-dimensional block. This 2D locality suggests the potential 
to achieve near-peak memory throughput. 

We conclude that the Hogbom CLEAN algorithm rep- 
resents a good candidate for implementation on many-core 
hardware, but will likely be bound by the available memory 
bandwidth rather than arithmetic computing performance. 



3.3 Volume rendering 

Introduction: There are a number of sources of volume 
data in astronomy, including spectral cubes from radio tele- 
scopes and integral field units, as well as simulations us- 
ing adaptive mesh refinement and smoothed particle hydro- 
dynamics techniques. Visualising these data in physically- 
meaningful ways is important as an analysis tool, but even 
small volumes (e.g., 256 3 ) require large amounts of comput- 
ing power to render, particularly when real-time interactiv- 
ity is desired. 

Several methods exist for rendering volume data; here 
we analyse a direct (or brute-force) ray-casting algorithm 
i|LevovHl990r ). While similarities exist between ray-shooting 
for microlensing (Section 13. l[l and the volume rendering 
technique we describe here, they are fundamentally different 
algorithms. 



Outline: The algorithm may be divided into the following 
steps: 

(i) Obtain the input data cube. 

(ii) Create a 2D grid of output pixels to be displayed. 

(iii) Generate a corresponding grid of rays, where each is 
defined by a position (initially the centre of the correspond- 
ing pixel), a direction (defined by the viewing transforma- 
tion) and a colour (initially black). 

(iv) Project each ray a small distance (the step size) along 
its direction. 

(v) Determine which voxel each ray now resides in. 

(vi) Retrieve the colour of the voxel from the data volume. 

(vii) Use a specified transfer function to combine the 
voxel colour with the current ray colour. 

(viii) Repeat from step (iv) until all rays exit the data 
volume. 

(ix) Output the final ray colours as the rendered image. 

Analysis: We interpret the steps in the above outline as 
follows: 

• Steps (ii) to (v) and (vii) are all transform operations. 

• Step (vi) is a gather operation. 

All steps perform work scaling with the number of out- 
put pixels, iVp X i s , indicating there are no algorithmic bottle- 
necks and thus acceleration should be applied to the whole 
algorithm equally. 

Given that the number of output pixels is likely to be 
large and scalable, we should expect the transforms and the 
gather, with their 0(1) depth complexities, to parallelise 
perfectly on many-core hardware. 

The outer loop of the algorithm, which marches rays 
through the volume until they leave its bounds, involves 
some branching as different rays traverse thicker or thinner 
parts of the arbitrarily-oriented cube. This will have a nega- 
tive impact on the performance of the algorithm on a SIMD 
architecture like a GPU. However, if rays are ordered in such 
a way as to maintain 2D locality between their positions, 
neighbouring threads will traverse similar depths through 
the data cube, resulting in little divergence in their branch 
paths and thus good performance on SIMD architectures. 

The arithmetic intensity of each of the steps will typ- 
ically be low (common transfer functions can be as simple 
as taking the average or maximum), while the complete al- 
gorithm requires 0(N p ^i s Nd) memory reads, 0(iVp X i s ) mem- 
ory writes and 0(N px i s Nd) function evaluations for an input 
data volume of side length Nd ■ This global arithmetic inten- 
sity of Nd : 1 : NdOt indicates the algorithm is likely to 
remain bandwidth-bound. 

The use of bandwidth will depend primarily on the 
memory access patterns in the gather step (the transform 
operations perform ideal contiguous memory accesses) . Dur- 
ing each iteration of the algorithm, the rays will access an 
arbitrarily oriented plane of voxels within the data volume. 
Such a pattern exhibits 3D spatial locality, presenting an 
opportunity to cache the memory reads effectively and thus 
obtain near-peak bandwidth. 

We conclude that the direct ray-casting volume render- 
ing algorithm is a good candidate for efficient implemen- 
tation on many-core hardware, although, in the absence of 
transfer functions with significant arithmetic intensity, the 
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algorithm is likely to remain limited by the available mem- 
ory bandwidth. 

3.4 Pulsar time-series dedispersion 

Introduction: Radio-telescopes observing pulsars produce 
time-series data containing the pulse signal. Due to its pas- 
sage through the interstellar medium, the pulse signature 
gets delayed as a function of frequency, resulting in a 'dis- 
persing' of the data. The signal can be 'dedispersed' by as- 
suming a frequency-dependent delay before summing the 
signals at each frequency. The data are dedispersed using 
a number of trial dispersion measures (DMs), from which 
the true DM of the signal is measured. 

There are two principle dedispersion algorithms used in 
the literature : the direct algorithm and the tree algorithm 
i|Tavlorlll974r) . Here we consider the direct method, which 
simply involves delaying and summing time series for a range 
of DMs. The calculation for each DM is entirely indepen- 
dent, presenting an immediate opportunity for parallelisa- 
tion. Further, each sample in the time series is operated-on 
individually, hinting at additional fine-grained parallelism. 
Outline: Here we describe the key steps of the algorithm: 

(i) Obtain a set of input time series, one per frequency 
channel. 

(ii) If necessary, transpose the input data to place it into 
channel-major order. 

(iii) Impose a time delay on each channel by offsetting its 
starting location by the number of samples corresponding 
to the delay. The delay introduced into each channel is a 
quadratic function of its frequency and a linear function of 
the dispersion measure. 

(iv) Sum aligned samples across every channel to produce 
a single accumulated time series. 

(v) Output the result and repeat (potentially in parallel) 
from step (iii) for each desired trial DM. 

Analysis: We interpret the above outline of the direct dedis- 
persion algorithm as follows: 

• Step (ii) involves transposing the data, which is a form 
of gather. 

• Step (iii) may be considered a set of gather operations 
that shift the reading location of samples in each channel by 
an offset. 

• Step (iv) involves the summation of many time series. 
This is a nested operation, and may be interpreted as either 
a transform, where the operation is to sum the time sample 
in each channel, or a reduce, where the operation is to sum 
whole time series. 

The algorithm therefore involves gather operations in 
addition to nested transforms and reductions. For data con- 
sisting of N s samples for each of N c channels, each step of 
the computation operates on all 0(N s N c ) total samples. Ac- 
celeration should thus be applied equally to all parts of the 
algorithm. 

According to the depth complexity listed in Table[T] the 
gather operation will parallelise perfectly. The nested trans- 
form and reduce calculation may be parallelised in three 
possible ways: a) by parallelising the transform, where N s 
parallel threads each compute the sum of a single time sam- 
ple over every channel sequentially; b) by parallelising the 



reduce, where N c parallel threads cooperate to sum each 
time sample in turn; or c) by parallelising both the transform 
and the reduce, where N„ x N c parallel threads cooperate 
to complete the entire computation in parallel. 

Analysing these three options, we see that they have 
depth complexities of 0(N C ), 0(N s logN c ) and C(logiV c ) 
respectively. Option (c) would appear to provide the great- 
est speedup; however, it relies on using significantly more 
parallel processors than the other options. It will in fact 
only be the better choice in the case where the number of 
available parallel processors is much greater than N 3 . For 
hardware with fewer than N s parallel processors, option (a) 
will likely prove the better choice, as it is expected to scale 
perfectly up to N s parallel threads, as opposed to the less 
efficient scaling of option (c). In practice, the number of time 
samples N 3 will generally far exceed the number of paral- 
lel processors, and thus the algorithm can be expected to 
exhibit excellent parallel scaling using option (a). 

Turning now to the arithmetic intensity, we observe that 
the computation of a single trial DM involves only an ad- 
dition for each of the N 3 x N c total samples. This suggests 
the algorithm will be limited by memory bandwidth. How- 
ever, this does not take into account the fact that we wish to 
compute many trial dispersion measures. The computation 
of Num trial DMs still requires only 0(N S x N c ) memory 
reads and writes, but performs ./Vdm xN a xN c addition oper- 
ations. The theoretical global arithmetic intensity is there- 
fore 1:1: -/Vdm. Given a typical number of trial DMs of 
0(100), we conclude that the algorithm could, in theory at 
least, make efficient use of all available arithmetic processing 
power. 

The ability to achieve such a high arithmetic intensity 
will depend on the ability to keep data in fast memory for 
the duration of many arithmetic calculations (i.e., the abil- 
ity to efficiently cache the data). This in turn will depend on 
the memory access patterns. We note that in general, similar 
trial DMs will need to access similar areas of memory; i.e., 
the problem exhibits some locality of reference. The exact 
memory access pattern is non-trivial though, and a discus- 
sion of these details is outside the scope of this work. 

We conclude that the pulsar dedispersion algorithm 
would likely perform to a high efficiency on a many-core 
architecture. While it is apparent that some locality of ref- 
erence exists within the algorithm's memory accesses, opti- 
mal arithmetic intensity is unlikely to be observed without 
a thorough and problem-specific analysis of the memory ac- 
cess patterns. 



4 DISCUSSION 

The direct inverse ra y-shooting metho d has been imple- 
mented on a GPU by iThompson et all i^Old ). They sim- 
ulated systems with up to 10 9 lenses. Using a single GPU, 
they parallelised the interaction step of the problem and ob- 
tained a speedup of O(100x) relative to a single CPU core 
- a result consistent with the relative peak floating-point 
performance of the two processing unit£j. These results val- 



8 We note that IThompson et al.l J201(t) did not use the CPU's 
Streaming SIMD Extensions, which have the potential to provide 
a speed increase of up to 4x. However, our conclusion regarding 
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idate our conclusion that the inverse ray-shooting algorithm 
is very well suited to many-core architectures like GPUs. 

Our conclusions regarding the pulsar dedispersion algo- 
rithm are validated by a preliminary GPU implementation 
we have written. With only a simplistic approach to mem- 
ory caching, we have recorded a speedup of 15 x over an 
efficient multi-core CPU code. This result is in line with the 
relative peak memory bandwidth of the two architectures, 
supporting the conclusions of Section 13.41 that, without a 
detailed investigation into the memory access patterns, the 
problem will remain bandwidth-bound. 

Some astronomy problems are well-suited to a many- 
core architecture, others are not. It is important to know 
how to distinguish between these. In the astronomy com- 
munity, the majority of work with many-core hardware to 
date has focused on the implementation or porting of specific 
codes perhaps best classified as 'low-hanging fruit'. Not sur- 
prisingly, these codes have achieved significant speed-ups, 
in line with the raw performance benefits offered by their 
target hardware. 

A more generalised use of 'novel' computi ng architec- 
tures was undertaken by iBrunner et al.l l|2007l ). who, as a 
case study, implemented the two-point angular correlation 
function for cosmological galaxy clustering on two differ- 
ent FPGA architectureCl- While they successfully commu- 
nicated the advantages offered by these new technologies, 
their focus on implementation details for their FPGA hard- 
ware inhibits the ability to generalise their findings to other 
architectures. 

It is interesting to note that previous work has in fact 
identified a number of common concerns with respect to 
GPU implementations of astronomy algorithms. For exam- 
ple, the issues of optimal use of the memory hierarchy and 
underuse of available hardware for small particle counts have 
been discussed in the conte xt of the direct N-body problem 
(e.g.. iBelleman et al.ll2008l ). These concerns essentially cor- 
respond to a combination of what we have referred to as 
memory access patterns, arithmetic intensity and massive 
parallelism. While originally being discussed as implemen- 
tation issues specific to particular choices of software and 
hardware, our abstractions re-cast them at the algorithm 
level, and allow us to consider their impact across a variety 
of problems and hardware architectures. 

Using algorithm analysis techniques, we now have a ba- 
sis for understanding which astronomy algorithms will bene- 
fit most from many-core processors. Those with well-defined 
memory access patterns and high arithmetic intensity stand 
to receive the greatest performance boost, while problems 
that involve a significant amount of decision-making may 
struggle to take advantage of the available processing power. 

For some astronomy problems, it may be important 
to look beyond the techniques currently in use, as these 
will have been developed (and optimised) with traditional 
CPU architectures in mind. Avenues of research could in- 
clude, for instance, using higher-order numerical schemes 



the efficiency of the algorithm on the GPU remains unchanged 
by this fact. 

9 Field Programmable Gate Arrays are another hardware archi- 
tecture exhibiting significant fine-grained parallelism, but their 
specific details lie outside the scope of this paper. 



l|Nitadori &: Making 120081 ) or choosing simplicity over effi- 
ciency by using brute-force methods (Bate et al. submitted) . 
Some algorithms, such as histogram generation, do not have 
a single obvious parallel implementation, and may require 
problem-specific input during the analysis process. 

In this work, we have discussed the future of astrononry 
computation, highlighting the change to many-core process- 
ing that is likely to occur in CPUs. 

The shift in commodity hardware from serial to parallel 
processing units will fundamentally change the landscape 
of computing. While the market is already populated with 
multi-core chips, it is likely that chip designs will undergo 
further significant changes in the coming years. We believe 
that for astronomy, a generalised methodology based on the 
analysis of algorithms is a prudent approach to confronting 
these changes - one that will continue to be applicable across 
the range of hardware architectures likely to appear in the 
coming years: CPUs, GPUs and beyond. 
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